199 research outputs found

    RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models

    Get PDF
    Summary: RAxML-VI-HPC (randomized axelerated maximum likelihood for high performance computing) is a sequential and parallel program for inference of large phylogenies with maximum likelihood (ML). Low-level technical optimizations, a modification of the search algorithm, and the use of the GTR+CAT approximation as replacement for GTR+Γ yield a program that is between 2.7 and 52 times faster than the previous version of RAxML. A large-scale performance comparison with GARLI, PHYML, IQPNNI and MrBayes on real data containing 1000 up to 6722 taxa shows that RAxML requires at least 5.6 times less main memory and yields better trees in similar times than the best competing program (GARLI) on datasets up to 2500 taxa. On datasets ≥4000 taxa it also runs 2-3 times faster than GARLI. RAxML has been parallelized with MPI to conduct parallel multiple bootstraps and inferences on distinct starting trees. The program has been used to compute ML trees on two of the largest alignments to date containing 25 057 (1463 bp) and 2182 (51 089 bp) taxa, respectively. Availability: Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics onlin

    Do Phylogenetic Tree Viewers correctly display Support Values?[Reprint]

    Get PDF
    Phylogenetic trees are routinely visualized to present and interpret the evolutionary relationships of the species that are being studied. Virtually all empirical evolutionary data studies contain a visualization of the inferred tree with support values using one of the popular and highly cited (e.g., TreeView, Dendroscope, FigTree, Archaeopteryx, etc.) tree viewing tools. As a consequence, programming errors or ambiguous semantics in tree file formats can lead to erroneous tree visualizations and consequently incorrect interpretations of phylogenetic analyses. Here, we discuss the problems that can and do arise when displaying branch support values on trees. Presumably for historical reasons, branch support values (e.g., bootstrap support or Bayesian posterior probabilities) are typically stored as node labels in the widely-used Newick tree format. However, support values are attributes of branches (bipartitions) in unrooted phylogenetic trees. Therefore, storing support values as node labels can potentially lead to incorrect support-valueto- bipartition mappings when re-rooting trees in tree viewers. This depends on the mostly implicit semantics of tree viewers for interpreting node labels. To assess the potential impact of these ambiguous and predominantly implicit semantics of support values, we analyzed 10 distinct tree viewers. We find that, most of them exhibit some sort of incorrect or unexpected behavior when re-rooting trees with support values. We find that Dendroscope interprets Newick node labels as simply that, node labels in Newick trees. However, if they are meant to represent branch support values, the support value to branch mapping is incorrect when re-rooting trees with Dendroscope. We illustrate such an incorrect mapping by example of an empirical phylogenetic study. As a solution, we suggest that (i) branch support values should exclusively be stored as meta-data associated to branches (and not nodes), and (ii) if this is not feasible, tree viewers should include a user dialogue that explicitly forces users to define if node labels shall be interpreted as node or branch labels, prior to tree visualization

    Root Digger: a root placement program for phylogenetic trees

    Get PDF
    Background In phylogenetic analysis, it is common to infer unrooted trees. However, knowing the root location is desirable for downstream analyses and interpretation. There exist several methods to recover a root, such as molecular clock analysis (including midpoint rooting) or rooting the tree using an outgroup. Non-reversible Markov models can also be used to compute the likelihood of a potential root position. Results We present a software called RootDigger which uses a non-reversible Markov model to compute the most likely root location on a given tree and to infer a confidence value for each possible root placement. We find that RootDigger is successful at finding roots when compared to similar tools such as IQ-TREE and MAD, and will occasionally outperform them. Additionally, we find that the exhaustive mode of RootDigger is useful in quantifying and explaining uncertainty in rooting positions. Conclusions RootDigger can be used on an existing phylogeny to find a root, or to asses the uncertainty of the root placemen

    A Rapid Bootstrap Algorithm for the RAxML Web Servers

    Get PDF
    Despite recent advances achieved by application of high-performance computing methods and novel algorithmic techniques to maximum likelihood (ML)-based inference programs, the major computational bottleneck still consists in the computation of bootstrap support values. Conducting a probably insufficient number of 100 bootstrap (BS) analyses with current ML programs on large datasets—either with respect to the number of taxa or base pairs—can easily require a month of run time. Therefore, we have developed, implemented, and thoroughly tested rapid bootstrap heuristics in RAxML (Randomized Axelerated Maximum Likelihood) that are more than an order of magnitude faster than current algorithms. These new heuristics can contribute to resolving the computational bottleneck and improve current methodology in phylogenetic analyses. Computational experiments to assess the performance and relative accuracy of these heuristics were conducted on 22 diverse DNA and AA (amino acid), single gene as well as multigene, real-world alignments containing 125 up to 7764 sequences. The standard BS (SBS) and rapid BS (RBS) values drawn on the best-scoring ML tree are highly correlated and show almost identical average support values. The weighted RF (Robinson-Foulds) distance between SBS- and RBS-based consensus trees was smaller than 6% in all cases (average 4%). More importantly, RBS inferences are between 8 and 20 times faster (average 14.73) than SBS analyses with RAxML and between 18 and 495 times faster than BS analyses with competing programs, such as PHYML or GARLI. Moreover, this performance improvement increases with alignment size. Finally, we have set up two freely accessible Web servers for this significantly improved version of RAxML that provide access to the 200-CPU cluster of the Vital-IT unit at the Swiss Institute of Bioinformatics and the 128-CPU cluster of the CIPRES project at the San Diego Supercomputer Center. These Web servers offer the possibility to conduct large-scale phylogenetic inferences to a large part of the community that does not have access to, or the expertise to use, high-performance computing resource

    Prediction of missing sequences and branch lengths in phylogenomic data

    Get PDF
    This is a pre-copyedited, author-produced version of an article accepted for publication in Bioinformatics following peer review. The version of recordDiego Darriba, Michael Weiß, Alexandros Stamatakis; Prediction of missing sequences and branch lengths in phylogenomic data, Bioinformatics, Volume 32, Issue 9, 1 May 2016, Pages 1331–1337, is available online at: https://doi.org/10.1093/bioinformatics/btv768[Abstract] Motivation: The presence of missing data in large-scale phylogenomic datasets has negative effects on the phylogenetic inference process. One effect that is caused by alignments with missing per-gene or per-partition sequences is that the inferred phylogenies may exhibit extremely long branch lengths. We investigate if statistically predicting missing sequences for organisms by using information from genes/partitions that have data for these organisms alleviates the problem and improves phylogenetic accuracy. Results: We present several algorithms for correcting excessively long branch lengths induced by missing data. We also present methods for predicting/imputing missing sequence data. We evaluate our algorithms by systematically removing sequence data from three empirical and 100 simulated alignments. We then compare the Maximum Likelihood trees inferred from the gappy alignments and on the alignments with predicted sequence data to the trees inferred from the original, complete datasets. The datasets with predicted sequences showed one to two orders of magnitude more accurate branch lengths compared to the branch lengths of the trees inferred from the alignments with missing data. However, prediction did not affect the RF distances between the trees

    Is the Protein Model Assignment problem under linked branch lengths NP-hard?

    Get PDF
    AbstractIn phylogenetics, computing the likelihood that a given tree generated the observed sequence data requires calculating the probability of the available data for a given tree (topology and branch lengths) under a statistical model of sequence evolution. Here, we focus on selecting an appropriate model for the data, which represents a generally non-trivial task. The data is represented as a so-called multiple sequence alignment. That is, each individual sequence of any one species (taxa) is arranged (aligned) in such a way, that the characters of all species at a given position (site) are assumed to share a common evolutionary history. It is well known, that an inappropriate model, which does not fit the data, can generate misleading tree topologies [3,4,26].More specifically, we consider the case of partitioned protein sequence alignments. This means that the sites of the alignment may be clustered together into different partitions. Each partition may have an individual model of evolution. Our objective is to maximize the likelihood of the per-partition protein model assignments (e.g., JTT, WAG, etc.) when branches are linked across partitions on a given, fixed tree topology. That is, branch lengths are not estimated individually for each partition. Linked branch lengths across partitions substantially reduce the number of free parameters.For p partitions and |M| possible substitution models, there are |M|p possible model assignments. Since the number of combinations grows exponentially with p, an exhaustive search for the highest scoring assignment is computationally prohibitive for |M|>1. We show that the problem of finding the optimal protein substitution model assignment under linked branch lengths on a given, tree topology, is NP-hard. Our results imply that one should employ heuristics to approximate the solution, instead of striving for the exact solution. Alternatively, the problem can be simplified by relaxing the assumptions

    Lagrange-NG: The next generation of Lagrange

    Get PDF
    Computing ancestral ranges via the Dispersion Extinction and Cladogensis (DEC) model of biogeography is characterized by an exponential number of states relative to the number of regions considered. This is because the DEC model requires computing a large matrix exponential, which typically accounts for up to 80% of overall runtime. Therefore, the kinds of biogeographical analyses that can be conducted under the DEC model are limited by the number of regions under consideration. In this work, we present a completely redesigned efficient version of the popular tool Lagrange which is up to 49 times faster with multithreading enabled, and is also 26 times faster when using only one thread. We call this new version Lagrange-NG (Lagrange-Next Generation). The increased computational efficiency allows Lagrange-NG to analyze datasets with a large number of regions in a reasonable amount of time, up to 12 regions in approximately 18 min. We achieve these speedups using a relatively new method of computing the matrix exponential based on Krylov subspaces. In order to validate the correctness of Lagrange-NG, we also introduce a novel metric on range distributions for trees so that researchers can assess the difference between any two range inferences. Finally, Lagrange-NG exhibits substantially higher adherence to coding quality standards. It improves a respective software quality indicator as implemented in the SoftWipe tool from average (5.5; Lagrange) to high (7.8; Lagrange-NG). Lagrange-NG is freely available under GPL2. [Biogeography; Phylogenetics; DEC Model.
    • …
    corecore